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Abstract — We propose a new gravitational based hierarchical 
clustering algorithm using kd- tree, kd- tree generates densely 
populated packets and finds the clusters using gravitational 
force between the packets. Gravitational based hierarchical 
clustering results are of high quality and robustness. Our 
method is effective as well as robust. Our proposed algorithm 
is tested on synthetic dataset and results are presented. 
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1. Introduction 

Hierarchical clustering generates a hierarchical series of 
nested clusters which can be graphically represented by a 
tree called "Dendrogram". By cutting the dendrogram at 
some level, we can obtain a specified number of clusters 
[3], Due to its nested structure it is effective and gives 
better structural information. This paper presents a new 
gravity based hierarchical technique using kd- Tree. We 
call our new algorithm GLHL (which is anagram of the 
bold letters in GravitationaL Based Hierarchical 
ALgorithm) The orientation of the paper is as follows: 
Section 2 presents survey of the related work. Section 3 
gives an overview of the hierarchical clustering algorithm 
and faf-trees. Section 4 presents our proposed algorithm. 
Next we discuss results in Section 5 and conclusions in 
Section 6. 

2. Related Work 

Here we review some of the pioneering methods in 
hierarchical clustering. Zhang et al. proposed BIRCH 
technique [6]. It overcomes the two limitations like 
scalability and the inability to undo what was done in the 
previous step, of agglomerative clustering [1] [6]. It is 
designed for clustering large amount of dataset and it is 
incremental and hierarchical and can handle outliers. It 
introduces a concept like clustering feature (CF) which 
contains information regarding a cluster. CF tree is created 
using CF and it contains CF information about its 
subclusters [2] [6]. BIRCH applies only to numeric data [2] 
and if the clusters are not spherical in nature BIRCH does 
not perform well [1]. Jiang et al. proposed DHC (density 
based hierarchical clustering) [7] . It uses two types of data 
structure called density tree which is used to uncover the 
embedded cluster and attraction tree which is used to 
explore the inner structure of clusters, the boundary of the 
cluster and the outliers. 



III. Overview of agglomerative clustering 
algorithm and kd-tree 

A. Agglomerative hierarchical clustering algorithm 

Agglomerative hierarchical algorithms begin with all the 
data objects as individual cluster. At each step two most 
alike clusters are merged. After each merge, the total 
number of clusters decreases by one. These steps can be 
repeated until the desired number of clusters is obtained or 
the distance between two closest clusters is above a certain 
threshold distance [5]. 

B. kd Tree 

kd- tree is a geometrical, top-down hierarchical tree data 
structure. At the root whole data space is divided with a 
vertical line into two subsets of roughly equal size. The 
splitting line is stored at the root, the left data points are 
assigned as the content of the left subtree and the right data 
points are assigned as the content of the right subtree. At 
the next level, each node is again partitioned along the 
alternate line (e.g. if the previous level was partitioned by 
vertical line, this level would be partitioned along 
horizontal line). At each partitioning, data points to the left 
or on a vertical line are assigned to the left subtree and the 
rest to the right subtree. The process continues till the 
criteria function has converged. [8]. 

IV. The proposed gravitational based hierarchical 

ALGORITHM 

The proposed algorithm uses kd- tree to divide the data 
space into regions called leaf buckets and calculate the 
density of each leaf buckets. Further, mean of each leaf 
buckets is calculated and treated as center of gravity of 
each leaf bucket. Next, an object (leaf bucket) with high 
density attracts some other objects (leaf bucket) with lower 
density [7]. Using gravity function we can calculate the 
attraction force between the two objects (leaf buckets) and 
the degree of gravity is in direct ratio to the product of two 
objects (leaf buckets) density and in inverse ratio to the 
square of their distance [9]. 

A. Calculation of density of a leaf bucket 

Consider a set of n data points, 
(p l ,0 2 ,...,O n ) occupying a t dimensional space. Each 
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point Oj has associated with it to 

coordinate (o, | ,0 !2 ,...,O h ) . There exists a bounding box 

(buckets) which contains all data points and whose extrema 
are defined by the maximum and minimum coordinate 
values of the data points in each dimension figure [la]. The 
data is then divided into two sub buckets by splitting the 
data along the median value of the coordinates of the parent 
bucket so that the number of points in each bucket remains 
roughly the same. The division process recursively 
repeated on each sub bucket until a leaf bucket is created 
where each leaf bucket contains single data point [Figure 
lb-lc]. The basic concept behind the calculation of density 
of each leaf bucket is with the help of bucket size and the 
number of data points present in that bucket (here each leaf 
bucket contain single data points but the bucket size 
differs). If the numbers of data points present are more then 
the density is likely to be more [11]. 




Figure la Figure lb Figure lc 

Figure 1: Implementation of kd- Tree 

B. Calculation of force of attraction 

The force of gravity is given by F — G - — . Where, 



mi, m 2 are two objects' weight, r is the distance between 
them and G is the universal gravitational constant. The 
formula used in this paper has been adapted from [9] and 
the density value calculated in section 4.1 has been used. 
The force of attraction between two buckets b t and b • is 

d t dj 

f . — — where d and d are density of two buckets 

" disfj 

b j andZ?., rfw.is distance between two buckets b i 
and b . 

C. Calculation of gain 

Jung et al proposed clustering gain as a measure for 
clustering optimality, which is based on the squared error 
sum as a clustering algorithm proceeds and the measure 
can be applicable for both hierarchical and partitional 
clustering method to estimate desired number of clusters. 
Authors showed the clustering gain to have a maximum 
value at the optimal number of clusters. In addition, 
clustering gain is cheap to compute. Therefore, it can be 
computed in each step of clustering process to determine 
the optimal number of clusters without increasing the 
computational complexity [10]. The clustering gain can be 



computed from A = / X s i ~ -0 ro ~°o > where s is 

7=1 ^ 

the number of data points in cluster C ■ , K denotes number 

of clusters, O is the global centroid defined 

If 
as O q — — /_, °i wnere n is total number of data points 
n M 

and O; is the data points, o denotes the centroid of the 

cluster j, which is defined 

1 Sj ■ 
as Oq = — / t Of where O • denotes data points 

S J w 

•th 

belongs to j cluster. 

D. The GLHL Algorithm 

This technique is based upon picking up the highest 
density leaf buckets and calculating the gravitational 
attraction force with the next lower density leaf buckets. 
The buckets with the maximum force of attraction are 
merged during each iteration. The iterations continue until 
a single bucket remains in the bucket set B. The algorithm 
uses the following data structures: 



B 


: Set of buckets 


»[l..n] 


:density vector where n is the number of data 




points 


g [0..n] 


: gam array 


dis [l..n][l..n] 


: distance array 


F, [l..n][l..n] 


: force array 


C [l..n] 


: number of clusters 




Table5.1(Dataset) 





Sr. no. 


Number of 
clusters 


Gain 


1 


i: 


88.98880 


2 


n 


S9. 99901 


3 


10 


95.11324 


■1 


9 


99.78969 


5 


K 


S7.7G554 


6 


7 


86.66780 





All these data structures (B, D, g, dis and F{) need to be 
recalculated at each iteration because of merger of buckets. 
The pseudocode for the GLHL algorithm is given in 
Algorithm 1. 
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Algorithm 1: Gravitational based hierarchical clustering 
algorithm (GLHL Algorithm) 

B = {b i }:b i = kd _tree _ partitioning (dataset); 

D = [d i ]:d i — calculate _ density (b i ); 

dis - [diSy ] : dis tj - calculate _ dis tan ce(b t , bj ) 

V buckets; 
F l =[fij]:fij = calculate _ force {b^bj) 

\f buckets; 

g[0] = Calculate _ gainij; //each data point as cluster and g vector 

stores value of gain 
(J — i; //counter for iterations 
Repeat 

A' — \d \; // number of buckets 
If JSl = 1 then break; // only one bucket remaining 

merge (b n b j ): f tj - find _ max( F, ) ; 

update B,D,dis,F 1 ; 
g[q] — calculate _ gainQ; 

C[(j[ ] — A" ; If c vectors stores number of clusters (bucket) 

c[q] = N; 

Until true; 

Return C[kJ as optimal number of clusters 

: g[k] = find _max(g); 

The GLHL algorithm functions in four phases. 
Phase 1 constitutes the initialization of buckets by M-tree 
partitions and data structures density D, distance dis, 
gravitational force F\ and the initial gaing[0] for all n 
buckets where n is total data points. Phase2 checks the exit 
criteria that there is one bucket remaining. Phase3 finds the 
maximum entry /.. in the F t matrices and merges buckets 

bj,b , corresponding to the (ij) pair. Phase4 updates the B, 

D, dis, F\, g and the number of clusters is stored in c and 
returns the number of clusters k corresponding to maximum 
gain. 

V. Results 

The GLHL algorithm was implemented in C-language 
and validations were performed. The data sets used is 
synthetic two dimensional data points generated by mouse 
clicking. Result is presented in Table 5.1. The rows in the 



bold font give the optimal number of clusters in each case 
indicated by maximum gain value. 

VI. Conclusion 

In this paper a new gravitational based hierarchical 
clustering algorithm using kd- tree has been proposed. The 
kd- tree structure is used to generate densely populated 
packets. The clusters are formed by calculating the 
gravitational force between the packets. For validation of 
the performance of our algorithm we have used the concept 
of gain calculated as in section 4.3. The validation results 
have been presented in section 5. 
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